Methods And Systems For Pushing Audiovisual Playlist Based On Text-Attentional Convolutional Neural Network

ABSTRACT

In some embodiments, methods and systems for pushing audiovisual playlists based on a text-attentional convolutional neural network include a local voice interactive terminal, a dialog system server and a playlist recommendation engine, where the dialog system server and the playlist recommendation engine are respectively connected to the local voice interactive terminal. In some embodiments, the local voice interactive terminal includes a microphone array, a host computer connected to the microphone array, and a voice synthesis chip board connected to the microphone array. In some embodiments, the playlist recommendation engine obtains rating data based on a rating predictor constructed by the neural network; the host computer parses the data into recommended playlist information; and the voice terminal synthesizes the results and pushes them to a user in the form of voice.

CROSS-REFERENCE TO RELATED APPLICATION(S)

This application claims priority benefits to Chinese Patent Application No. 202010375669.4, entitled “Methods and Systems for Pushing Audiovisual Playlist Based On Text-Attentional Convolutional Neural Network” filed with the China National Intellectual Property Administration on May 7, 2020.

The '669.4 application is incorporated herein by reference in its entirety.

FIELD OF THE INVENTION

The present disclosure relates to the technical field of smart voice interaction and machine learning (ML), and in particular to a method and system for pushing an audiovisual playlist to a user through voice interaction.

In the era of information explosion, information overload plagues peoples' daily lives. Even the most advanced information retrieval systems can create information overload as the number of results they retrieve are often huge. In current interactive information recommendation systems, information is generally pushed after a user actively subscribes, and is not recommended according to the user's real concerns and/or points of interest. These passive interactive recommendation systems fail to meet the diverse needs of people in daily life. A way of improving retrieval accuracy, filtering redundant information, and/or providing users with the information they are really interested in is needed.

SUMMARY OF THE INVENTION

The present disclosure provides methods and systems for pushing an audiovisual playlist based on a text-attentional convolutional neural network (Text-CNN). Passive interactive recommendation systems offer poor interaction between a user and a viewing system and cannot provide a user with a personalized playlist push service.

In at least some embodiments, a method for pushing an audiovisual playlist based on a text-attentional convolutional neural network (Text-CNN), includes the steps of:

(A) constructing a user information database and an audiovisual information database;

(B) processing an audiovisual introduction text in the audiovisual information database by using a text digitization technique to obtain fully digital structured data, using the fully digital structured data as an input into the text-CNN, and calculating a hidden feature of the audiovisual introduction text by the following equation:

$\begin{matrix} \left\{ {\begin{matrix} {Z_{w} = {\tanh\left( {{WX}_{w} + p} \right)}} \\ {y_{w} = {{KZ}_{w} + q}} \end{matrix},} \right. & (1) \end{matrix}$

where, W is a feature extraction weight coefficient of an input layer of the text-CNN; K is a feature extraction weight coefficient of a hidden layer; WϵR^(n) ^(h) ^(×(n−1)m); PϵR^(n) ^(h) ; KϵR^(n) ^(h) ^(×N); qϵR^(N); a projection layer X_(w) is a vector composed of n−1 word vectors of the input layer, with a length of (n−1)m;

calculating y_(w)={Y_(w,1), y_(w,2), . . . , y_(w,N)}, then letting w_(i) represent a word in a corpus Context(w_(i)) composed of the audiovisual introduction text, and normalizing by a softmax function to obtain a similarity probability of word w_(i) in a user rating of a movie:

$\begin{matrix} {{p\left( {w❘{{Context}(w)}} \right)} = \frac{e^{y_{w,i_{w}}}}{\sum\limits_{i = 1}^{N}e^{y_{w,i}}}} & (2) \end{matrix}$

where, i_(w) represents an index of word w in corpus Context(w_(i)); y_(w,i) _(w) represents a probability that word w is indexed as i_(w) in corpus Context(w_(i)) when the corpus is Context(w);

letting the obtained hidden feature of an audiovisual introduction text be F in an entire convolution process, F={F₁, F₂, . . . , F_(D)} and letting F_(j) be a jth hidden feature of the audiovisual introduction, then:

F _(j)=text_cnn(W,X)  (3)

where, W is the feature extraction weight coefficient of the input layer of the text-CNN; X is a probability matrix after the digitization of the audiovisual introduction text;

(C) extracting a rating feature of probability matrix X by a convolutional layer of the text-CNN, and setting the size of a convolution window to D×L;

amplifying and extracting, by a max-pooling layer, a feature processed by the convolutional layer and affecting a user's rating into several feature maps, that is, using N one-dimensional (1D) vectors H_(N) as an input in a fully connected layer;

mapping, by the fully connected layer and an output layer, a 1D digital vector representing main feature information of a movie into a D-dimensional hidden feature matrix V of movies about user rating;

(D) counting historical initial rating information of users from an open dataset Movielens 1 m, and obtaining a digital rating matrix of [0,5] according to a normalization function, where N represents a user set; M represents a movie set; R_(ij) represents a rating matrix of user u_(i) about movie m_(j); R=[R_(ij)]_(m×n) represents an overall initial rating matrix of users; decomposing R into a hidden feature matrix UϵR^(D×N) of user rating and a hidden feature matrix VϵR^(D×N) of movies;

calculating a user similarity uSim(u_(i),u_(j)), and classifying a user with a similarity greater than 0.75 as a neighboring user;

$\begin{matrix} {{{uSim}\left( {u_{i},u_{j}} \right)} = \frac{\sum_{m \in R^{M}}{\left( {r_{u_{i},m} - {\overset{\_}{r}}_{m}} \right)\left( {r_{u_{j},m} - {\overset{\_}{r}}_{m}} \right)}}{\sqrt{\sum_{m \in R^{M}}\left( {r_{u_{i},m} - {\overset{\_}{r}}_{m}} \right)^{2}}\sqrt{\sum_{m \in R^{M}}\left( {r_{u_{j},m} - {\overset{\_}{r}}_{m}} \right)^{2}}}} & (4) \end{matrix}$

where, R^(M) represents a set of movies with rating results; u_(i),u_(j) are users participating in the rating; r_(u) _(i) _(,m) represents the rating of movie m by user u_(i); r _(m) represents a mean of the rating;

(E) subjecting the overall initial rating matrix R of users to model-based probability decomposition, where σ_(U) is a variance of a hidden feature matrix of users obtained by decomposing R_(ij); σ_(V) is a variance of a hidden feature matrix of movies obtained by decomposing R_(ij); constructing a potential rating matrix {tilde over (R)}=[Ŕ_(ij)]^(m×n) of users as a user rating predictor, {tilde over (R)}_(ij)=U_(i) ^(T)F_(j), specifically: constructing a probability density function for the overall initial rating matrix R of users as follows:

$\begin{matrix} {{p\left( {U,{V❘R},\sigma^{2},\sigma_{V}^{2},\sigma_{U}^{2}} \right)} = {{\prod\limits_{i = 1}^{N}{\prod\limits_{j = 1}^{M}{I_{ij}{\ln\left\lbrack {N\left( {{R_{ij}\ ❘\ {U_{i}^{T}V_{j}}}\ ,\sigma^{2}} \right)} \right\rbrack}}}} + {\prod\limits_{i = 1}^{N}{\ln\;{N\left( {{U_{i}❘0},{\sigma_{U}^{2}I}} \right)}}} + {\prod\limits_{j = 1}^{M}{\ln{N\left( {\left. V_{i} \middle| 0 \right.,{\sigma_{V}^{2}I}} \right)}}}}} & (5) \end{matrix}$

where, N is a zero mean Gaussian distribution probability density function; σ is a variance of the overall initial rating matrix of users; I is a marking function regarding whether a user rates after watching a movie;

iteratively updating U and V by using a gradient descent method until a loss function E converges, so as to obtain a hidden feature matrix that best represents the user and the movie in a fitting process:

$\begin{matrix} {E = {{\frac{1}{2}{\sum\limits_{i = 1}^{N}{\sum\limits_{j = 1}^{M}{I_{ij}\left( {R_{ij} - {U_{i}^{T}F_{j}}} \right)}^{2}}}} + {\frac{\phi^{2}}{2\phi_{U}^{2}}{\sum\limits_{i = 1}^{N}{U_{i}}^{2}}} + {\frac{\phi^{2}}{2\phi_{F}^{2}}{\sum\limits_{j = 1}^{M}{F_{j}}^{2}}}}} & (6) \end{matrix}$

where, I_(ij) is a marking function regarding whether user i participates in rating movie j; if yes, I_(ij) is 1, otherwise I_(ij) is 0; ϕ, ϕ_(U) and ϕ_(F) are regularization parameters to prevent overfitting;

using the loss function E and the gradient descent method to calculate the hidden feature matrix U of users and the hidden feature matrix V of movies:

$\begin{matrix} {{\frac{\partial E}{\partial U} = {{- V} + {\phi_{U}U}}}{\frac{\partial E}{\partial V} = {{- U} + {\phi_{V}V}}}} & (7) \end{matrix}$

iteratively updating to calculate the hidden feature matrix U of users and the hidden feature matrix V of movies until E converges:

U=U+ρ(V−ϕ _(U) U)

V=V+ρ(V−ϕ _(V) V)  (8)

where, ρ represents a learning rate;

(F) saving an algorithm model after training by step (E) as a model file, where the model file is called in a service program of a playlist push engine; and

(G) defining a semantic slot for a smart audiovisual playlist scene in a dialog server, and triggering an entity related to the audiovisual playlist and defined in the semantic slot to enable an audiovisual playlist recommendation function when a voice dialog with a neighboring user is conducted in the smart audiovisual playlist scene.

In at least some embodiments, the present disclosure provides systems for pushing an audiovisual playlist according to a method for pushing an audiovisual playlist based on a Text-CNN, including a local voice interactive terminal, a dialog system server and a playlist recommendation engine, where the dialog system server and the playlist recommendation engine are respectively connected to the local voice interactive terminal.

In at least some embodiments, the local voice interactive terminal includes a microphone array, a host computer and a voice synthesis chip board; the microphone array is connected to the voice synthesis chip board; the voice synthesis chip board is connected to the host computer; the host computer is connected to the dialog system server; and/or the playlist recommendation engine is connected to the voice synthesis chip board.

In at least some embodiments, the microphone array is used to collect a user's voice information and transmit the collected voice information to the host computer; the host computer processes the voice information and sends the processed voice information to the dialog system server; the dialog system server generates appropriate dialog text information through semantic matching based on the voice information sent by the host computer, and sends the dialog text information to the host computer via a transmission control protocol/Internet protocol (TCP/IP); the host computer parses the dialog text information sent by the dialog system server, and sends the parsed dialog text information to the voice synthesis chip board; and/or the voice synthesis chip board converts the dialog text information into voice information and sends the voice information to the microphone array to broadcast to the user.

In at least some embodiments, the playlist recommendation engine is used to generate audiovisual playlist information for the dialog user according to the user's dialog information, and transmit the audiovisual playlist information to the voice synthesis chip board via the TCP/IP protocol; the voice synthesis chip board generates a voice playlist push message according to the audiovisual playlist information and sends the voice playlist push message to the microphone array to broadcast to the user.

In at least some embodiments, the methods and systems for pushing an audiovisual playlist based on a Text-CNN can realize convenient interaction with users and avoid shortcomings of traditional inconvenient interactive methods such as user interfaces (UI) and manual clicks. The methods and systems can realize effective integration with other software and hardware services with voice control as the core in smart home scenes such as movies on demand, which provide users with more convenient services while satisfying users' personalized requirements for movies on demand, etc. The methods and systems can help products and/or services have a deeper understanding of users' needs based on the original basic design and timely adjust the output results.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a schematic diagram of a system for pushing an audiovisual playlist based on a Text-CNN according to some embodiments.

FIG. 2 is a schematic diagram of a Text-CNN according to some embodiments.

FIG. 3 illustrates a process of extracting feature information of an audiovisual text according to some embodiments.

FIG. 4 illustrates a decomposition process of user and movie information matrixes according to some embodiments.

FIG. 5 is a schematic diagram of a probability model introduced to the matrix decomposition process according to some embodiments.

FIG. 6 is a working process of a system for pushing an audiovisual playlist based on a Text-CNN according to some embodiments.

DETAILED DESCRIPTION OF ILLUSTRATIVE EMBODIMENT(S)

The present disclosure is described in further detail below with reference to the accompanying drawings and embodiments.

Some embodiments provide methods for pushing an audiovisual playlist based on a text-attentional convolutional neural network (Text-CNN). The methods can include the following steps:

(A) Constructing a user information database and an audiovisual information database. Specifically, a user's basic information can be recorded in a MySQL database through a user information collection module to form the user information database, and a PySpider environment can be set up to capture movie information to a mongodb database to form the audiovisual information database.

(B) Processing an audiovisual introduction text in the audiovisual information database by using a text digitization technique to obtain fully digital structured data, using the fully digital structured data as an input into the text-CNN, and calculating a hidden feature of the audiovisual introduction text by the following equation:

$\begin{matrix} {\left\{ \begin{matrix} {{z_{w} = {\tanh\left( {{WX_{w}} + p} \right)}},} \\ {{y_{w} = {{Kz_{w}} + q}},} \end{matrix} \right..} & (1) \end{matrix}$

In the equation, W is a feature extraction weight coefficient of an input layer of the text-CNN; K is a feature extraction weight coefficient of a hidden layer; WϵR^(n) ^(h) ^(×(n−1)m); pϵR^(n) ^(h) ^(×N); qϵR^(N); a projection layer X_(w) is a vector composed of n−1 word vectors of the input layer, with a length of (n−1)m.

Calculating y_(w)={y_(w,1), Y_(w,2), . . . , y_(w,N)}, then letting w_(i) represent a word in a corpus Context(w_(i)) composed of the audiovisual introduction text, and normalize by a softmax function to obtain a similarity probability of word w_(i) in a user rating of a movie:

$\begin{matrix} {{p\left( {w❘{{Context}\;(w)}} \right)} = {\frac{e^{y_{w,i_{w}}}}{\overset{N}{\sum\limits_{i = 1}}e^{y_{w,i}}}.}} & (2) \end{matrix}$

In the equation, i_(w) represents an index of word w in corpus Context(w_(i)); y_(w,i) _(w) represents a probability that word w is indexed as i_(w) in corpus Context(w_(i)) when the corpus is Context(w).

Letting the obtained hidden feature of an audiovisual introduction text be F in an entire convolution process, F={F₁, F₂, . . . , F_(D)} and letting F_(j) be a jth hidden feature of the audiovisual introduction, then:

F _(j)=text_cnn(W,X)  (3).

In the equation, W is the feature extraction weight coefficient of the input layer of the text-CNN; X is a probability matrix after the digitization of the audiovisual introduction text.

(C) Extracting a rating feature of probability matrix X by a convolutional layer of the text-CNN, and setting the size of a convolution window to D×L; amplifying and extracting, by a max-pooling layer, a feature processed by the convolutional layer and affecting a user's rating after the processing of into several feature maps, that is, using N one-dimensional (1D) vectors H_(N) as an input in a fully connected layer; finally, mapping, by the fully connected layer and an output layer, a 1D digital vector representing main feature information of a movie into a D-dimensional hidden feature matrix V of movies about user rating.

(D) Counting historical initial rating information of users from an open dataset Movielens 1 m, and obtaining a digital rating matrix of [0,5] according to a normalization function, where N represents a user set; M represents a movie set; R_(ij) represents a rating matrix of user u_(i) about movie m_(j); R=[R_(ij)]_(m×n) represents an overall initial rating matrix of users; decomposing R into a hidden feature matrix UϵR^(D×N) of user rating and a hidden feature matrix VϵR^(D×N) of movies, where the feature matrix has D dimensions; then, calculating a user similarity, and classifying a user with a similarity greater than 0.75 as a neighboring user;

$\begin{matrix} {{{uSi{m\left( {u_{i},u_{j}} \right)}} = \frac{\sum_{m \in R^{M}}{\left( {r_{u_{i},m} - \overset{¯}{r_{m}}} \right)\left( {r_{u_{j},m} - \overset{¯}{r_{m}}} \right)}}{\sqrt{\sum_{m \in R^{M}}\left( {r_{u_{i},m} - \overset{¯}{r_{m}}} \right)^{2}}\sqrt{\sum_{m \in R^{M}}\left( {r_{u_{j},m} - \overset{¯}{r_{m}}} \right)^{2}}}}.} & (4) \end{matrix}$

In the equation, R^(M) represents a set of movies with rating results; u_(i), u_(j) of are users participating in the rating; represents the rating of movie m by user u_(i); r _(m) represents a mean of the rating.

(E) Subjecting the overall initial rating matrix R of users to model-based probability decomposition, where σ_(U) is a variance of a hidden feature matrix of users obtained by decomposing R_(ij); σ_(V) is a variance of a hidden feature matrix of movies obtained by decomposing R_(ij); construct a potential rating matrix {tilde over (R)}=[{tilde over (R)}_(ij)]_(m×n) of users as a user rating predictor, Ŕ_(ij)=U_(i) ^(T)F_(j), specifically:

Constructing a probability density function for the overall initial rating matrix R of users as follows:

$\begin{matrix} {{p\left( {U,{V❘R},\sigma^{2},\sigma_{V}^{2},\sigma_{U}^{2}} \right)} = {{\prod\limits_{i = 1}^{N}{\prod\limits_{j = 1}^{M}{I_{ij}{\ln\left\lbrack {N\left( {{R_{ij}❘\ {U_{i}^{T}V_{j}}}\ ,\sigma^{2}} \right)} \right\rbrack}}}} + {\prod\limits_{i = 1}^{N}{\ln\;{N\left( {{U_{i}❘0},{\sigma_{U}^{2}I}} \right)}}} + {\prod\limits_{j = 1}^{M}{\ln{{N\left( {\left. V_{i} \middle| 0 \right.,{\sigma_{V}^{2}I}} \right)}.}}}}} & (5) \end{matrix}$

In the equation, N is a zero mean Gaussian distribution probability density function; σ is a variance of the overall initial rating matrix of users; I is a marking function regarding whether a user rates after watching a movie.

Iteratively updating U and V by using a gradient descent method until a loss function E converges, so as to obtain a hidden feature matrix that best represents the user and the movie in a fitting process:

$\begin{matrix} {E = {{\frac{1}{2}{\sum\limits_{i = 1}^{N}{\sum\limits_{j = 1}^{M}{I_{ij}\left( {R_{ij} - {U_{i}^{T}F_{j}}} \right)}^{2}}}} + {\frac{\phi^{2}}{2\phi_{U}^{2}}{\sum\limits_{i = 1}^{N}{U_{i}}^{2}}} + {\frac{\phi^{2}}{2\phi_{F}^{2}}{\sum\limits_{j = 1}^{M}{{F_{j}}^{2}.}}}}} & (6) \end{matrix}$

In the equation, I_(ij) is a marking function regarding whether user i participates in rating movie j; if yes, I_(ij) is 1, otherwise I_(ij) is 0; ϕ, ϕ_(U) and ϕ_(F) are regularization parameters to prevent overfitting.

Using the loss function E and the gradient descent method to calculate the hidden feature matrix U of users and the hidden feature matrix V of movies:

$\begin{matrix} {{\frac{\partial E}{\partial U} = {{- V} + {\phi_{U}U}}}{\frac{\partial E}{\partial V} = {{- U} + {\phi_{V}{V.}}}}} & (7) \end{matrix}$

Iteratively updating to calculate the hidden feature matrix U of users and the hidden feature matrix V of movies until E converges:

U=U+ρ(V−ϕ _(U) U)

V=V+ρ(V−ϕ _(V) V)  (8).

In the equation, P represents a learning rate. In this embodiment, P is 0.25.

(F) Saving an algorithm model after training by step (E) as a model file. In some embodiments, a Tensorflow deep learning (DL) library is used to save the algorithm model trained in step (E) as a Tensorflow model file, which is called in a service program of a playlist push engine.

(G) Defining a semantic slot for a smart audiovisual playlist scene in a dialog server, and triggering an entity related to the audiovisual playlist and defined in the semantic slot to enable an audiovisual playlist recommendation function when a voice dialog with a neighboring user is conducted in the smart audiovisual playlist scene.

Some embodiments provide a system for pushing an audiovisual playlist according to a Text-CNN-based method for pushing an audiovisual playlist to interactive user 101. In some embodiments, the systems can include local voice interactive terminal 102, dialog system server 105 and playlist recommendation engine 106, where dialog system server 105 and playlist recommendation engine 106 are respectively connected to local voice interactive terminal 102.

Local voice interactive terminal 102 can include a microphone array, a host computer and/or a voice synthesis chip board. In some embodiments, the voice synthesis chip board is connected with the host computer, and the host computer is a Linux host computer. In some of these embodiments, the microphone array is connected to the voice synthesis chip board and the host computer. The host computer can be connected to dialog system server 105 through voice interactive interface 103. The voice synthesis chip board can be connected to playlist recommendation engine 106 through a Website User Interface (WebUI) or user interface (UI) interactive interface 104 and can be used for intuitive display of a recommended playlist.

The microphone array can be used to collect a user's voice information and transmit the collected voice information to the host computer. In at least some embodiments, the host computer can process the voice information and send the processed voice information to the dialog system server.

In at least some embodiments, dialog system server 105 generates appropriate dialog text information through semantic matching based on the voice information sent by the host computer, and sends the dialog text information to the host computer via a transmission control protocol/Internet protocol (TCP/IP). The host computer parses the dialog text information sent by dialog system server 105 and sends the parsed dialog text information to the voice synthesis chip board. The voice synthesis chip board can convert the dialog text information into voice information and send the voice information to the microphone array to broadcast to the user.

In at least some embodiments, playlist recommendation engine 106 is used to generate audiovisual playlist information for the dialog user according to the user's dialog information, and transmit the audiovisual playlist information to the voice synthesis chip board via the TCP/IP protocol. The voice synthesis chip board can generate a voice playlist push message according to the audiovisual playlist information and send the voice playlist push message to the microphone array to broadcast to the user.

In at least some embodiments, the methods and systems for pushing an audiovisual playlist based on a Text-CNN in the present disclosure realize convenient interaction with users and avoids the shortcomings of traditional inconvenient interactive methods such as UI and manual click. In at least some embodiments, the present disclosure realizes effective integration with other software and hardware services with voice control as the core in smart home scenes such as movies on demand, which provides users with more convenient services while satisfying users' personalized requirements for movies on demand, etc. The present disclosure can help products or services to have a deeper understanding of user needs based on the original basic design and timely adjust the output results.

It should be noted that the above embodiments are only intended to explain, rather than to limit the technical solutions of the present disclosure. Although the present disclosure is described in detail with reference to the preferred embodiments, those skilled in the art should understand that modifications or equivalent substitutions may be made to the technical solutions of the present disclosure without departing from the spirit and scope of the technical solutions of the present disclosure, and such modifications or equivalent substitutions should be included within the scope of the claims of the present disclosure. 

What is claimed is:
 1. A method for pushing an audiovisual playlist based on a text-attentional convolutional neural network comprising: (A) constructing a user information database and an audiovisual information database; (B) processing an audiovisual introduction text in said audiovisual information database comprising (i) using a text digitization technique to obtain a fully digital structured data; (ii) using said fully digital structured data as an input into said text-attentional convolutional neural network, and (iii) calculating a hidden feature of said audiovisual introduction text by a first equation: $\quad\left\{ \begin{matrix} {{z_{w} = {\tanh\left( {{WX_{w}} + p} \right)}}\ ,} \\ {{y_{w} = {{Kz_{w}} + q}},} \end{matrix} \right.$ wherein, W is a feature extraction weight coefficient of an input layer of said text-attentional convolutional neural network; K is a feature extraction weight coefficient of a hidden layer; WϵR^(n) ^(h) ^(x(n−1)m); pϵR^(n) ^(h) ; KϵR^(n) ^(h) ^(×N); qϵR^(N); and a projection layer X_(w) is a vector composed of n−1 word vectors of the input layer, with a length of (n−1)m; (iv) calculating y_(w)={y_(w,1), y_(w,2), . . . , y_(w,N)}, and letting w_(i) represent a word in a corpus Context(w_(i)) composed of said audiovisual introduction text, and normalizing by a softmax function to obtain a similarity probability of word w_(i) in a user rating of a movie: ${p\left( w \middle| {{Context}\;(w)} \right)} = \frac{e^{y_{w,i_{w}}}}{\overset{N}{\sum\limits_{i = 1}}e^{y_{w,i}}}$ wherein, i_(w) represents an index of word w in said corpus Context(w_(i)); y_(wj) _(w) represents a probability that word w is indexed as i_(w) in said corpus Context(w_(i)) when said corpus is Context(w); (v) letting said hidden feature of said audiovisual introduction text be F in an entire convolution process, F={F₁, F₂, . . . , F_(D)}, and letting F_(j) be a jth hidden feature of said audiovisual introduction, then: F_(j)=text_cnn(W,X) wherein, W is the feature extraction weight coefficient of the input layer of said text-attentional convolutional neural network; X is a probability matrix after digitization of the audiovisual introduction text; (C) extracting a rating feature of probability matrix X by a convolutional layer of said text-attentional convolutional neural network; setting a size of a convolution window to D×L; amplifying and extracting, by a max-pooling layer, a feature processed by the convolutional layer and affecting a user's rating into several feature maps, that is, using N one-dimensional (1D) vectors H_(N) as an input in a fully connected layer; and mapping, by the fully connected layer and an output layer, a 1D digital vector representing main feature information of a movie into a D-dimensional hidden feature matrix V of movies about user rating; (D) counting historical initial rating information of users from an open dataset Movielens 1 m, and obtaining a digital rating matrix of [0,5] according to a normalization function, wherein N represents a user set; M represents a movie set; R_(ij) represents a rating matrix of user u_(i) about movie m_(j); R=[R_(ij)]_(m×n) represents an overall initial rating matrix of users; decomposing R into a hidden feature matrix UϵR^(D×N) of user rating and a hidden feature matrix VϵR^(D×N) of movies; then, calculating a user similarity usim(u_(i), u_(j)), and classifying a user with a similarity greater than 0.75 as a neighboring user; ${{uSi}{m\left( {u_{i},u_{j}} \right)}} = \frac{\sum_{m \in R^{M}}{\left( {r_{u_{i},m} - \overset{¯}{r_{m}}} \right)\left( {r_{u_{j},m} - \overset{¯}{r_{m}}} \right)}}{\sqrt{\sum_{m \in R^{M}}\left( {r_{u_{i},m} - \overset{¯}{r_{m}}} \right)^{2}}\sqrt{\sum_{m \in R^{M}}\left( {r_{u_{j},m} - \overset{¯}{r_{m}}} \right)^{2}}}$ wherein, R^(M) represents a set of movies with rating results; u_(i), u_(j) are users participating in the rating; r_(u) _(i) _(,m) represents the rating of movie m by user u_(i); r _(m) represents a mean of the rating; (E) subjecting the overall initial rating matrix R of users to model-based probability decomposition, wherein σ_(U) is a variance of a hidden feature matrix of users obtained by decomposing R_(ij); σ_(V) is a variance of a hidden feature matrix of movies obtained by decomposing R_(ij); constructing a potential rating matrix {tilde over (R)}=[{tilde over (R)}_(ij)]_(m×n) of users as a user rating predictor, {tilde over (R)}_(ij)=U_(i) ^(T)F_(j), constructing a probability density function for the overall initial rating matrix R of users as follows: ${p\left( {U,{V❘R},\sigma^{2},\sigma_{V}^{2},\sigma_{U}^{2}} \right)} = {{\prod\limits_{i = 1}^{N}{\prod\limits_{j = 1}^{M}{I_{ij}{\ln\left\lbrack {N\left( {{R_{ij}❘\ {U_{i}^{T}V_{j}}}\ ,\sigma^{2}} \right)} \right\rbrack}}}} + {\prod\limits_{i = 1}^{N}{\ln\;{N\left( {{U_{i}❘0},{\sigma_{U}^{2}I}} \right)}}} + {\prod\limits_{j = 1}^{M}{\ln{N\left( {\left. V_{i} \middle| 0 \right.,{\sigma_{V}^{2}I}} \right)}}}}$ wherein, N is a zero mean Gaussian distribution probability density function; σ is a variance of the overall initial rating matrix of users; I is a marking function regarding whether a user rates after watching said movie; iteratively updating U and V by using a gradient descent method until a loss function E converges, so as to obtain a hidden feature matrix that best represents the user and the movie in a fitting process: $E = {{\frac{1}{2}{\sum\limits_{i = 1}^{N}{\sum\limits_{j = 1}^{M}{I_{ij}\left( {R_{ij} - {U_{i}^{T}F_{j}}} \right)}^{2}}}} + {\frac{\phi^{2}}{2\phi_{U}^{2}}{\sum\limits_{i = 1}^{N}{U_{i}}^{2}}} + {\frac{\phi^{2}}{2\phi_{F}^{2}}{\sum\limits_{j = 1}^{M}{F_{j}}^{2}}}}$ wherein, I_(ij) is a marking function regarding whether user i participates in rating movie j; if yes, I_(ij) is 1, otherwise I_(ij) is 0; ϕ, ϕ_(U) and ϕ_(F) are regularization parameters to prevent overfitting; using said loss function E and the gradient descent method to calculate a hidden feature matrix U of users and a hidden feature matrix V of movies: $\frac{\partial E}{\partial U} = {{- V} + {\phi_{U}U}}$ $\frac{\partial E}{\partial V} = {{- U} + {\phi_{V}V}}$ iteratively updating to calculate said hidden feature matrix U of users and said hidden feature matrix V of movies until E converges: U=U+ρ(V−ϕ _(U) U) V=V+ρ(V−ϕ _(V) V) wherein, ρ represents a learning rate; (F) saving an algorithm model after training by step (E) as a model file, wherein the model file is called in a service program of a playlist push engine; and (G) defining a semantic slot for a smart audiovisual playlist scene in a dialog server, and triggering an entity related to the audiovisual playlist and defined in the semantic slot to enable an audiovisual playlist recommendation function when a voice dialog with said neighboring user is conducted in the smart audiovisual playlist scene.
 2. A system for pushing an audiovisual playlist according to a method for pushing said audiovisual playlist based on a text-attentional convolutional neural network comprising: (A) a local voice interactive terminal comprising: (i) a microphone array; (ii) a host computer; and (iii) a voice synthesis chip board; (B) a dialog system server; and (C) a playlist recommendation engine, wherein said dialog system server and said playlist recommendation engine are respectively connected to said local voice interactive terminal; wherein said microphone array is connected to said voice synthesis chip board; wherein said voice synthesis chip board is connected to said host computer; wherein said host computer is connected to said dialog system server; and wherein said playlist recommendation engine is connected to said voice synthesis chip board.
 3. The system of claim 2 wherein said microphone array is used to collect a user's voice information and transmit said user's voice information to said host computer; wherein said host computer processes said user's voice information and sends the processed user's voice information to said dialog system server; said dialog system server generates dialog text information through semantic matching based on said processed user's voice information and sends said dialog text information to said host computer via a transmission control protocol/Internet protocol (TCP/IP); said host computer parses said dialog text information and sends the parsed dialog text information to said voice synthesis chip board; said voice synthesis chip board converts said parsed dialog text information into voice information and sends said voice information to said microphone array to broadcast to said user; said playlist recommendation engine generates audiovisual playlist information for said dialog user according to said user's dialog information and transmit said audiovisual playlist information to said voice synthesis chip board via said TCP/IP protocol; and said voice synthesis chip board generates a voice playlist push message according to said audiovisual playlist information and sends said voice playlist push message to said microphone array to broadcast to said user. 